The MSR-Video to Text dataset with clean annotations

نویسندگان

چکیده

Video captioning automatically generates short descriptions of the video content, usually in form a single sentence. Many methods have been proposed for solving this task. A large dataset called MSR to Text (MSR-VTT) is often used as benchmark testing performance methods. However, we found that human annotations, i.e., contents are quite noisy, e.g., there many duplicate captions and contain grammatical problems. These problems may pose difficulties models learning underlying patterns. We cleaned MSR-VTT annotations by removing these problems, then tested several typical on dataset. Experimental results showed data cleaning boosted performances measured popular quantitative metrics. recruited subjects evaluate model trained original datasets. The behavior experiment demonstrated dataset, generated were more coherent relevant clips.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language Supplementary Material

When organizing the Microsoft Research Video To Language challenge [1], we found that, in our previously released dataset [10], some sentences annotated by AMT workers are identical in one video clip or very similar in one category. Therefore, to control the quality of data and annotations, as well as the competitions, we removed those simple and duplicated sentences and replaced them with refi...

متن کامل

Reasoning with Text Annotations

With the emerging need for automation of business processes and the advent of semantic web it has become necessary that digital contents should be expressed not only in natural language, but also in a form that can be understood, interpreted and used by software agents, thus permitting them to find, share and integrate information more easily. Thus, Knowledge Representation and automated reason...

متن کامل

MSR-Asia at TREC-11 Video Track

The Media Computing Group of Microsoft Research Asia participated in all the three tasks of Video tracks of TREC-11, including automatic Shot Boundary Determination, Semantic Feature Extraction and Video Search. A robust shot detector was proposed. Systems for semantic feature extraction and video retrieval which integrated many recent research results of this group’s are presented.

متن کامل

vIewIng temporal vIDeo annotatIons

Video is a complex information space that requires advanced navigational aids for effective browsing. The increasing number of temporal video annotations offers new opportunities to provide video navigation according to a user's needs. We present a novel video browsing interface called TAV (Temporal Annotation Viewing) that provides the user with a visual overview of temporal video annotations....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Computer Vision and Image Understanding

سال: 2022

ISSN: ['1090-235X', '1077-3142']

DOI: https://doi.org/10.1016/j.cviu.2022.103581